Conversation
Adds a new stringAgg aggregate function that concatenates string values within groups with configurable separator and ordering. Uses a two-tier architecture: a stateful incremental path with O(log n) binary search for efficient delta updates, and a stateless fallback for simpler use cases. Includes query builder overloads, compiler integration, docs, and comprehensive test coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
More templates
@tanstack/angular-db
@tanstack/db
@tanstack/db-browser-wa-sqlite-persisted-collection
@tanstack/db-ivm
@tanstack/db-react-native-sqlite-persisted-collection
@tanstack/db-sqlite-persisted-collection-core
@tanstack/electric-db-collection
@tanstack/offline-transactions
@tanstack/powersync-db-collection
@tanstack/query-db-collection
@tanstack/react-db
@tanstack/rxdb-db-collection
@tanstack/solid-db
@tanstack/svelte-db
@tanstack/trailbase-db-collection
@tanstack/vue-db
commit: |
|
Size Change: +272 B (+0.25%) Total Size: 110 kB
ℹ️ View Unchanged
|
|
Size Change: 0 B Total Size: 4.23 kB ℹ️ View Unchanged
|
There was a problem hiding this comment.
PR #1382 Review: stringAgg aggregate function
Overall assessment
The implementation is sound for its intended use cases and the tests are thorough. But there are real architectural concerns, one potential correctness issue, and some unnecessary complexity worth discussing.
1. It's not truly incremental — the reduce function is O(n) every time
The reduce function rebuilds nextEntriesByKey by scanning all entries in the Index for the group on every invocation:
// groupBy.ts ~line 430-457
const nextEntriesByKey = new Map<...>()
for (const [entry, multiplicity] of values) { // ← ALL accumulated values
if (entry.rowKey == null || multiplicity <= 0 || entry.value == null) continue
nextEntriesByKey.set(entry.rowKey, {...})
}The "incremental" benefit is only in steps 2 and 3 — maintaining the sorted array and doing fast-path text splicing. Step 1 (building the target state) is always O(k) where k = entries in the Index for that group.
This is a limitation of the ReduceOperator contract (it always passes all accumulated values), so it's not something stringAgg can easily avoid. But the PR description's emphasis on "O(log n) binary search" and "fast-path string slicing" is somewhat misleading — those optimizations only help after the O(k) scan has already happened.
The good news: the Index does consolidate entries via content hashing (MurmurHash in packages/db-ivm/src/hashing/hash.ts). When a row is removed, the +1 and -1 entries for the same content cancel out and are deleted. So k ≈ number of active rows in the group, not historical total. This makes the O(k) scan reasonable in practice.
2. Correctness depends on a fragile invariant: "last positive-multiplicity entry wins"
This is my biggest concern. The code builds the target state by iterating Index entries and setting nextEntriesByKey[rowKey] for each positive-multiplicity entry, with last-write-wins:
for (const [entry, multiplicity] of values) {
if (multiplicity <= 0 ...) continue
nextEntriesByKey.set(entry.rowKey, {...}) // last positive wins
}Compare this to how existing aggregates work:
sum:total += value * multiplicity— algebraically correct regardless of entry order, handles negative multiplicities naturallycount:totalCount += nullMultiplier * multiplicity— same: algebraicstringAgg: skips negative multiplicities entirely — correct only if the Index properly consolidates
If the Index ever fails to consolidate (e.g., hash collision between two different pre-mapped objects, or a bug in consolidation), stringAgg would silently include ghost entries that should have been removed. The algebraic aggregates would still be correct because value * (+1) + value * (-1) = 0.
This is a real fragility difference. The hash function is 32-bit MurmurHash — collisions are rare but not impossible, and when they happen, stringAgg would be the first aggregate to produce visibly wrong results.
3. Array.splice() is O(n) — the binary search doesn't help as much as claimed
The sorted array maintenance uses splice for both insert and remove:
// Insert
state.orderedEntries.splice(index, 0, entry) // O(n) array shift
// Remove
state.orderedEntries.splice(index, 1) // O(n) array shiftThe O(log n) binary search finds the position, but the O(n) splice dominates. For a group with 10,000 entries, each insert/remove shifts thousands of array elements. A balanced BST or skip list would give true O(log n) insert/remove, but that's probably over-engineering for typical group sizes.
4. Fast-path text splicing: clever but has wasted work
The head/tail text splicing optimization is well-implemented — it correctly uses exact value lengths rather than searching for separator positions, so it handles values containing the separator string correctly (as tested).
However, when textDirty is already true (from a middle-position change), subsequent remove/insert operations still perform fast-path string modifications that will be thrown away:
// If an earlier change set textDirty = true, this string work is wasted
if (index === entryCount - 1) {
state.text = state.text.slice(0, state.text.length - suffixLength)
return false // says "no rebuild needed" but textDirty is already true
}Not a correctness issue — the rebuild at the end produces the right result. But it's unnecessary string allocation for batch updates that include a middle-position change.
5. The stateful closure pattern is novel and creates lifecycle coupling
This is the first aggregate to use closure-based external state (groupStates Map). All other aggregates (sum, count, avg, min, max, median, mode) are stateless — they compute the result from the full set of values each time.
The stateful pattern requires the new cleanup callback:
type BasicAggregateFunction<T, R, V = unknown, Reduced = V> = {
preMap: (data: T) => V
reduce: (values: Array<[V, number]>, groupKey: string) => Reduced
postMap?: (result: Reduced) => R
cleanup?: (groupKey: string) => void // NEW: only needed by stringAgg
}Plus the reduce function signature change to pass groupKey (also new). These are framework-level changes to support a single aggregate. The cleanup is correctly called when totalMultiplicity <= 0, which handles group deletion. But this creates a coupling between the aggregate's lifecycle and the reduce operator's — if the graph is ever recreated or the reduce operator is reset without calling cleanup, the stale groupStates entries would cause incorrect results.
6. Minor issues in the compiler integration
In packages/db/src/query/compiler/group-by.ts, the stringagg case disambiguates separator vs orderBy by checking expression type:
const separator =
separatorOrOrderByExpr?.type === `val` &&
typeof separatorOrOrderByExpr.value === `string`
? separatorOrOrderByExpr.value
: ``This means stringAgg(col, "") (empty string separator) would work, but stringAgg(col, someStringColumn) would incorrectly treat someStringColumn as an orderBy expression (since it's a ref, not a val). That's actually the correct behavior per the API design — but it means you can't use a column reference as a dynamic separator. Worth documenting.
7. The compareStringAggOrderValues comparison works but has edge cases
Comparing bigint and number via < / > works in JS. Comparing boolean gives false < true. These are fine. But comparing string vs number (e.g., if a user accidentally passes mixed types) gives unpredictable results since JS coerces strings to NaN. The type system should prevent this in practice, but there's no runtime guard.
Summary
| Aspect | Assessment |
|---|---|
| Correctness | Sound under normal conditions; fragile vs hash collisions |
| Incremental benefit | Real but limited — O(k) scan is unavoidable, splice is O(n) |
| Best use case | Append-only streams (LLM chunks) where tail-insert fast-path shines |
| Code complexity | High for the benefit; the stateful pattern + cleanup adds framework-level changes |
| Test coverage | Excellent — covers inserts, removes, updates, reordering, group deletion/recreation, separator-in-value, fallback path |
The main question I have: is the stateful incremental approach worth the complexity? For the LLM streaming use case (appending to the end), the fast-path tail insert avoids an O(total_text_length) rebuild per chunk, which is genuinely valuable. But for other use cases (random inserts/removals in the middle), it degrades to a full rebuild anyway, and the O(k) scan + O(n) splice overhead means it's not dramatically faster than a simple "sort and join" on each call.
Summary
Adds a
stringAggaggregate function that concatenates string values within groups, with configurable separators and ordering. Available across the full stack: query builder, compiler, and IVM engine.Approach
Two-tier IVM architecture:
rowKeyExtractor): Simple sort-and-join on each update. Used when row identity isn't available.Query builder overloads disambiguate the flexible API surface:
The compiler distinguishes separator (literal string) from orderBy (column reference) by expression type, and always provides a
rowKeyExtractorto enable the incremental path.Key invariants
entriesByKeyandorderedEntriesmust stay synchronized —removeStringAggEntrynow throws on desynchronization rather than silently returningtextDirtyfor a full rebuildstring_aggsemantics)Non-goals
Verification
All 140 tests pass, including new coverage for:
rowKeyExtractor)Files changed
packages/db-ivm/src/operators/groupBy.tsstringAggimplementation with binary search, incremental text maintenance, and fallback pathpackages/db-ivm/src/operators/reduce.tspackages/db/src/query/builder/functions.tsstringAggbuilder with 4 overloads andOrderByLiketypepackages/db/src/query/compiler/group-by.tsstringAggpackages/db/src/query/index.tsstringAggdocs/guides/live-queries.md🤖 Generated with Claude Code